Skip to content

Python: Clean up LocalSourceNode charpred#6204

Merged
RasmusWL merged 2 commits intogithub:mainfrom
tausbn:python-ensmallen-localsourcenode
Jul 13, 2021
Merged

Python: Clean up LocalSourceNode charpred#6204
RasmusWL merged 2 commits intogithub:mainfrom
tausbn:python-ensmallen-localsourcenode

Conversation

@tausbn
Copy link
Copy Markdown
Contributor

@tausbn tausbn commented Jul 1, 2021

This should be a straight-up improvement where performance is concerned, though in practice many of the removed nodes did not have any flow to begin with.

I'll be running a detailed evaluation to see if it actually improves performance noticeably. (Overall it seems promising: https://github.com/dsp-testing/tausbn-dca/issues/42)

tausbn added 2 commits July 1, 2021 15:02
This gets rid of a large number of nodes that seemingly have no impact.
This results in the same set of nodes, but is a bit more clear about
the reasons why. For instance, `ModuleVariableNode`s are included
directly, and not in a roundabout way by virtue of not having flow to
them. This should hopefully be a bit more robust as well.
@tausbn tausbn added Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish no-change-note-required This PR does not need a change note labels Jul 1, 2021
@tausbn tausbn requested a review from a team as a code owner July 1, 2021 19:16
@github-actions github-actions Bot added the Python label Jul 1, 2021
@tausbn
Copy link
Copy Markdown
Contributor Author

tausbn commented Jul 2, 2021

Unsurprisingly, this seems to be a bigger improvement on bigger databases. Detailed results here

@tausbn tausbn removed the Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish label Jul 2, 2021
Copy link
Copy Markdown
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍

There is a little thing around ModuleVariableNode that I would like to understand better though.

Comment on lines +39 to 45
// Module variable nodes must be local source nodes, otherwise type trackers cannot step through
// them.
this instanceof ModuleVariableNode
or
// We explicitly include any read of a global variable, as some of these may have local flow going
// into them.
this = any(ModuleVariableNode mvn).getARead()
Copy link
Copy Markdown
Member

@RasmusWL RasmusWL Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these comments, makes it much easier 👍

Are there any instances where having ModuleVariableNode provides any value instead of just having any(ModuleVariableNode mvn).getARead()? That is, could we write this part as:

    // Module variable nodes must be local source nodes, otherwise type trackers cannot
    // step through them. We explicitly include any read of a global variable, as some
    // of these may have local flow going into them.
    this = any(ModuleVariableNode mvn).getARead()

I tried running the following query on a few databases I had lying around (salt, nova, peewee), but it did not give any results.

import python
import semmle.python.dataflow.new.internal.LocalSources
import semmle.python.dataflow.new.internal.DataFlowPublic
private import semmle.python.dataflow.new.internal.DataFlowPrivate

from ModuleVariableNode mvn, Node flowsToNode
where
  not mvn = flowsToNode and
  mvn.(LocalSourceNode).flowsTo(flowsToNode) and
  not mvn.getARead().(LocalSourceNode).flowsTo(flowsToNode)
select mvn, flowsToNode

I guess I could just try to run our tests with that change, so I've just started doing that 😛

Copy link
Copy Markdown
Member

@RasmusWL RasmusWL Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried out running the tests, and some of them failed. For example

x = m4.blah4 #$ use=moduleImport("a4").getMember("b4").getMember("c4").getMember("blah4")

and the query above doesn't give any results either, so clearly that query is not good enough 😓

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trouble arises when you want to type track from some_global = tracked to sink(some_global). In practice this happens in two steps: first we go from tracked to ModuleVariableNode for some_global and then from this node to the read.

However, since type trackers (in their usual formulation) step from LocalSourceNode to LocalSourceNode, we'll lose the first step unless the ModuleVariableNode is also a LocalSourceNode.

On the other hand, we should eventually be able to get rid of the getARead bit. It's only there right now because we have local flow into global variables, which I think we should get rid of (but we do want it in the global scope, so this is slightly tricky).

For instance we currently have local flow from source to sink in the following bit of code:

def make_safe():
    global g
    g = NON_SOURCE

def foo():
    global g
    g = SOURCE
    make_safe()
    sink(g)

Now, we'll actually have this regardless (unless we put some effort into tracking what globals are modified by what functions, which we might), since there will be the aforementioned step through the module variable node. But we shouldn't have local flow here.

I think based on this discussion that I should maybe expand the comment for ModuleVariableNode a bit. 😅

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#6218 (which is based on this branch) clarifies this situation a bit by creating a specific TypeTrackingNode class that encompasses both LocalSourceNode and ModuleVariableNode. That way, we don't have the somewhat confusing inclusion of ModuleVariableNode as a LocalSourceNode.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this PR be closed then, and we focus on the other one instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to merge this one as is (assuming you're happy with it), as the changes it makes have been tested (performance and otherwise), whereas the performance tests for the other PR are still ongoing. Also, the other changes will affect the shared type tracking library, so are perhaps a bit more controversial.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright, fine by me 👍

Copy link
Copy Markdown
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minor thing about improved explanation of module-variables should be solved by #6218, so this one should be good to go :shipit:

@RasmusWL RasmusWL merged commit 1a59c9b into github:main Jul 13, 2021
@tausbn tausbn deleted the python-ensmallen-localsourcenode branch September 14, 2022 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-change-note-required This PR does not need a change note Python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants